Phishing Link Detection Machine Learning¶

Jonathan Christyadi (502705) - AI Core 02

This notebook aims at predicting the likelihood of a link being a phishing link or a legitimate link with a focus on exploring and testing hypotheses that necessitate further research.

Dataset: https://data.mendeley.com/datasets/c2gw7fy2j4/3

In [ ]:
import sklearn
import pandas as pd
import seaborn
import numpy as np
print("scikit-learn version:", sklearn.__version__)     # 1.1.3
print("pandas version:", pd.__version__)            # 1.5.1
print("seaborn version:", seaborn.__version__)          # 0.12.1
scikit-learn version: 1.4.1.post1
pandas version: 2.2.1
seaborn version: 0.13.2

📦 Data provisioning¶

After loading the dataset, I found out some inconsistencies among the data. First the label of the link (phishing or legitimate) can be changed into binary format. Also, for domain_with_copyright column, some are in binary and some are written in alphabets, for example: zero, One, etc.

In [ ]:
df = pd.read_csv("Data\dataset_link_phishing.csv", sep=',', index_col=False, dtype='unicode')
df.head()
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
0 0 http://www.progarchives.com/album.asp?id=61737 46 20 0 3 0 0 1 0 ... 1 one 0 627 6678 78526 0 0 5 phishing
1 1 http://signin.eday.co.uk.ws.edayisapi.dllsign.... 128 120 0 10 0 0 0 0 ... 1 zero 0 300 65 0 0 1 0 phishing
2 2 http://www.avevaconstruction.com/blesstool/ima... 52 25 0 3 0 0 0 0 ... 1 zero 0 119 1707 0 0 1 0 phishing
3 3 http://www.jp519.com/ 21 13 0 2 0 0 0 0 ... 1 one 0 130 1331 0 0 0 0 legitimate
4 4 https://www.velocidrone.com/ 28 19 0 2 0 0 0 0 ... 0 zero 0 164 1662 312044 0 0 4 legitimate

5 rows × 87 columns

In [ ]:
df.sample(5)
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
15739 7738 https://cteam-my.sharepoint.com/:o:/g/personal... 126 23 1 2 1 0 1 0 ... 1 0 0 382 8018 0 0 1 4 phishing
3077 3077 http://doc.google.share.pressurecookerindia.co... 150 40 1 5 0 0 1 0 ... 1 zero 0 343 4405 0 0 1 0 phishing
11363 3362 https://www.sonlight.com/ 25 16 1 2 0 0 0 0 ... 0 0 0 1379 7753 140382 0 0 4 legitimate
13001 5000 https://grabyourcode.com/paypal/adder/index.html 48 16 1 2 0 0 0 0 ... 1 0 0 284 1541 2573053 0 0 0 phishing
7827 7827 http://www.acostamueble.com/img/ 32 20 0 2 0 0 0 0 ... 1 zero 0 888 5321 0 0 1 2 phishing

5 rows × 87 columns

In [ ]:
columns = df.columns.tolist()

with open("output.txt", "w") as file:
    for column in columns:
        file.write(column + "\n")
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19431 entries, 0 to 19430
Data columns (total 85 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   url_length                  19431 non-null  int64  
 1   hostname_length             19431 non-null  int64  
 2   ip                          19431 non-null  object 
 3   total_of.                   19431 non-null  int64  
 4   total_of-                   19431 non-null  int64  
 5   total_of@                   19431 non-null  object 
 6   total_of?                   19431 non-null  int64  
 7   total_of&                   19431 non-null  object 
 8   total_of=                   19431 non-null  object 
 9   total_of_                   19431 non-null  object 
 10  total_of~                   19431 non-null  object 
 11  total_of%                   19431 non-null  object 
 12  total_of/                   19431 non-null  int64  
 13  total_of*                   19431 non-null  object 
 14  total_of:                   19431 non-null  object 
 15  total_of,                   19431 non-null  object 
 16  total_of;                   19431 non-null  object 
 17  total_of$                   19431 non-null  object 
 18  total_of_www                19431 non-null  int64  
 19  total_of_com                19431 non-null  object 
 20  total_of_http_in_path       19431 non-null  object 
 21  https_token                 19431 non-null  object 
 22  ratio_digits_url            19431 non-null  float64
 23  ratio_digits_host           19431 non-null  object 
 24  punycode                    19431 non-null  object 
 25  port                        19431 non-null  object 
 26  tld_in_path                 19431 non-null  object 
 27  tld_in_subdomain            19431 non-null  object 
 28  abnormal_subdomain          19431 non-null  object 
 29  nb_subdomains               19431 non-null  object 
 30  prefix_suffix               19431 non-null  object 
 31  random_domain               19431 non-null  object 
 32  shortening_service          19431 non-null  object 
 33  path_extension              19431 non-null  object 
 34  nb_redirection              19431 non-null  object 
 35  nb_external_redirection     19431 non-null  object 
 36  length_words_raw            19431 non-null  object 
 37  char_repeat                 19431 non-null  object 
 38  shortest_words_raw          19431 non-null  object 
 39  shortest_word_host          19431 non-null  object 
 40  shortest_word_path          19431 non-null  object 
 41  longest_words_raw           19431 non-null  object 
 42  longest_word_host           19431 non-null  object 
 43  longest_word_path           19431 non-null  object 
 44  avg_words_raw               19431 non-null  object 
 45  avg_word_host               19431 non-null  object 
 46  avg_word_path               19431 non-null  object 
 47  phish_hints                 19431 non-null  int64  
 48  domain_in_brand             19431 non-null  object 
 49  brand_in_subdomain          19431 non-null  object 
 50  brand_in_path               19431 non-null  object 
 51  suspecious_tld              19431 non-null  object 
 52  statistical_report          19431 non-null  object 
 53  nb_hyperlinks               19431 non-null  int64  
 54  ratio_intHyperlinks         19431 non-null  object 
 55  ratio_extHyperlinks         19431 non-null  object 
 56  ratio_nullHyperlinks        19431 non-null  object 
 57  nb_extCSS                   19431 non-null  object 
 58  ratio_intRedirection        19431 non-null  object 
 59  ratio_extRedirection        19431 non-null  object 
 60  ratio_intErrors             19431 non-null  object 
 61  ratio_extErrors             19431 non-null  object 
 62  login_form                  19431 non-null  object 
 63  external_favicon            19431 non-null  object 
 64  links_in_tags               19431 non-null  object 
 65  submit_email                19431 non-null  object 
 66  ratio_intMedia              19431 non-null  object 
 67  ratio_extMedia              19431 non-null  object 
 68  sfh                         19431 non-null  object 
 69  iframe                      19431 non-null  object 
 70  popup_window                19431 non-null  object 
 71  safe_anchor                 19431 non-null  object 
 72  onmouseover                 19431 non-null  object 
 73  right_clic                  19431 non-null  object 
 74  empty_title                 19431 non-null  object 
 75  domain_in_title             19431 non-null  int64  
 76  domain_with_copyright       19431 non-null  int32  
 77  whois_registered_domain     19431 non-null  object 
 78  domain_registration_length  19431 non-null  object 
 79  domain_age                  19431 non-null  object 
 80  web_traffic                 19431 non-null  object 
 81  dns_record                  19431 non-null  object 
 82  google_index                19431 non-null  int64  
 83  page_rank                   19431 non-null  int64  
 84  status                      19431 non-null  int64  
dtypes: float64(1), int32(1), int64(13), object(70)
memory usage: 12.5+ MB
In [ ]:
# Sampling the dataset

df.sample(10)
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
12368 4367 http://bridgeburglar.com/bridge-burglars-guide... 77 17 1 1 6 0 0 0 ... 1 1 0 161 2761 4459552 0 0 1 legitimate
2116 2116 http://sanangelo.iconcinemas.com/ 33 25 0 2 0 0 0 0 ... 1 zero 0 135 3153 9482009 0 0 3 legitimate
14707 6706 http://nintendo.wikia.com/wiki/Nintendo_Switch 46 18 1 2 0 0 0 0 ... 1 1 0 140 6070 14420 0 0 5 legitimate
969 969 https://www.justice.gov/atr/blame-switchman-ru... 90 15 0 2 7 0 0 0 ... 1 one 0 0 -1 4382 0 0 6 legitimate
8546 545 https://www.azurepower.com/ 27 18 1 2 0 0 0 0 ... 1 1 0 1119 4724 942542 0 0 4 legitimate
2176 2176 https://mail.parkhill.k12.mo.us/owa/auth/logon... 123 23 0 9 0 0 1 1 ... 1 zero 1 0 -1 105946 0 1 4 phishing
15179 7178 https://login.microsoftonline.com/decee90c-ce0... 557 25 1 5 24 0 1 9 ... 1 1 0 350 6589 30 0 1 4 legitimate
17788 9787 http://www.payscale.com/research/US/Job=Magnet... 97 16 1 2 0 0 0 0 ... 0 0 0 1290 7841 3990 0 0 5 legitimate
4291 4291 http://starmak.com.tr/950CAAEA0281AA2BEBED8F9E... 76 14 1 2 0 0 1 0 ... 1 zero 0 0 4376 0 0 1 1 phishing
3559 3559 https://s.free.fr/92rsZcB4 26 9 0 2 0 0 0 0 ... 1 zero 0 518 7800 2868149 0 1 5 phishing

10 rows × 87 columns

Preprocessing¶

🆔 Encoding¶

In [ ]:
df['status'] = df['status'].map({'phishing': 1, 'legitimate': 0})
df.head()
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
0 0 http://www.progarchives.com/album.asp?id=61737 46 20 0 3 0 0 1 0 ... 1 one 0 627 6678 78526 0 0 5 1
1 1 http://signin.eday.co.uk.ws.edayisapi.dllsign.... 128 120 0 10 0 0 0 0 ... 1 zero 0 300 65 0 0 1 0 1
2 2 http://www.avevaconstruction.com/blesstool/ima... 52 25 0 3 0 0 0 0 ... 1 zero 0 119 1707 0 0 1 0 1
3 3 http://www.jp519.com/ 21 13 0 2 0 0 0 0 ... 1 one 0 130 1331 0 0 0 0 0
4 4 https://www.velocidrone.com/ 28 19 0 2 0 0 0 0 ... 0 zero 0 164 1662 312044 0 0 4 0

5 rows × 87 columns

In [ ]:
df['domain_with_copyright'] = df['domain_with_copyright'].map({'one': 1, 'zero': 0, 'Zero': 0, 'One': 1,'1': 1, '0': 0}).astype(int)
df['domain_with_copyright'].unique()
Out[ ]:
array([1, 0])

Checking null or NaN values¶

In [ ]:
# Calculate the total number of missing values in the DataFrame
total_na = df.isna().sum()
In [ ]:
# Calculate the total number of missing values in the DataFrame
total_null = df.isnull().sum()
total_null.sum()
Out[ ]:
0
In [ ]:
# Finding columns with binary values

def count_binary_columns(df):
    results = []
    counter = 0
    for col in df.columns:
        counter += 1
        if df[col].isin([0, 1]).all():
            results.append(col)
    return results, counter


count_binary_columns(df)
Out[ ]:
(['domain_in_title', 'domain_with_copyright', 'google_index', 'status'], 85)
In [ ]:
df = df.drop(columns=['id', 'url'])
df.head()
Out[ ]:
url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& total_of= total_of_ ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
0 46 20 0 3 0 0 1 0 1 0 ... 1 1 0 627 6678 78526 0 0 5 1
1 128 120 0 10 0 0 0 0 0 0 ... 1 0 0 300 65 0 0 1 0 1
2 52 25 0 3 0 0 0 0 0 0 ... 1 0 0 119 1707 0 0 1 0 1
3 21 13 0 2 0 0 0 0 0 0 ... 1 1 0 130 1331 0 0 0 0 0
4 28 19 0 2 0 0 0 0 0 0 ... 0 0 0 164 1662 312044 0 0 4 0

5 rows × 85 columns

In [ ]:
df['whois_registered_domain'].unique()
Out[ ]:
array(['0', '1'], dtype=object)
In [ ]:
print(df['status'].value_counts())
df['status'].value_counts().plot(kind='bar', title='Count the target variable')    
status
0    9716
1    9715
Name: count, dtype: int64
Out[ ]:
<Axes: title={'center': 'Count the target variable'}, xlabel='status'>

💡 Feature selection¶

A heatmap will be used to select a suitable set of features to predict the status target upon. At this stage, I have zero idea which feature to use and I utilized heatmap to find features with the most corellation with the target feature.

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

corr = df.corr()
plt.figure(figsize=(100, 100))
plot = sns.heatmap(corr, annot=True, fmt='.2f', linewidths=2)
In [ ]:
# Sorting the correlation values with the target variable in descending order
corr.drop('status').sort_values(by='status', ascending=False).plot.bar(y='status', title='Correlation with the target variable', figsize=(20, 10))
Out[ ]:
<Axes: title={'center': 'Correlation with the target variable'}>
In [ ]:
# Finding the most correlated features with the target variable based on numerical featrures excluding NaN values
correlation_matrix = df.corr(numeric_only=True)
sorted_corr = correlation_matrix.sort_values(by='status',ascending=False)
sorted_corr
Out[ ]:
url_length hostname_length total_of. total_of- total_of? total_of/ total_of_www ratio_digits_url phish_hints nb_hyperlinks domain_in_title domain_with_copyright google_index page_rank status
status 0.244348 0.240681 0.205302 -0.102849 0.293920 0.240892 -0.444561 0.356587 0.337287 -0.341295 0.339519 -0.175469 0.730684 -0.509761 1.000000
google_index 0.233061 0.216919 0.208764 -0.018285 0.202097 0.289212 -0.357215 0.323157 0.279906 -0.269482 0.265933 -0.144499 1.000000 -0.386721 0.730684
ratio_digits_url 0.434626 0.171761 0.224194 0.110341 0.325739 0.206925 -0.211165 1.000000 0.096967 -0.128915 0.152393 -0.027357 0.323157 -0.181489 0.356587
domain_in_title 0.124224 0.218850 0.108442 0.009843 0.092191 0.088462 -0.178402 0.152393 0.125857 -0.217548 1.000000 0.076105 0.265933 -0.332742 0.339519
phish_hints 0.332000 -0.019901 0.168765 0.065562 0.208052 0.501321 -0.090812 0.096967 1.000000 -0.112423 0.125857 -0.066130 0.279906 -0.203464 0.337287
total_of? 0.523172 0.164129 0.353133 0.035958 1.000000 0.243749 -0.115337 0.325739 0.208052 -0.112604 0.092191 -0.046123 0.202097 -0.123151 0.293920
url_length 1.000000 0.217586 0.447198 0.406951 0.523172 0.486490 -0.067973 0.434626 0.332000 -0.098101 0.124224 -0.004281 0.233061 -0.099900 0.244348
total_of/ 0.486490 -0.061203 0.242216 0.204793 0.243749 1.000000 -0.005628 0.206925 0.501321 -0.073183 0.088462 -0.023213 0.289212 -0.113861 0.240892
hostname_length 0.217586 1.000000 0.406834 0.059480 0.164129 -0.061203 -0.130991 0.171761 -0.019901 -0.104614 0.218850 0.073107 0.216919 -0.160621 0.240681
total_of. 0.447198 0.406834 1.000000 0.049303 0.353133 0.242216 0.068290 0.224194 0.168765 -0.093994 0.108442 0.057320 0.208764 -0.098752 0.205302
total_of- 0.406951 0.059480 0.049303 1.000000 0.035958 0.204793 0.045756 0.110341 0.065562 -0.004513 0.009843 0.020914 -0.018285 0.104676 -0.102849
domain_with_copyright -0.004281 0.073107 0.057320 0.020914 -0.046123 -0.023213 0.087826 -0.027357 -0.066130 0.192159 0.076105 1.000000 -0.144499 0.057127 -0.175469
nb_hyperlinks -0.098101 -0.104614 -0.093994 -0.004513 -0.112604 -0.073183 0.114259 -0.128915 -0.112423 1.000000 -0.217548 0.192159 -0.269482 0.221066 -0.341295
total_of_www -0.067973 -0.130991 0.068290 0.045756 -0.115337 -0.005628 1.000000 -0.211165 -0.090812 0.114259 -0.178402 0.087826 -0.357215 0.110745 -0.444561
page_rank -0.099900 -0.160621 -0.098752 0.104676 -0.123151 -0.113861 0.110745 -0.181489 -0.203464 0.221066 -0.332742 0.057127 -0.386721 1.000000 -0.509761
In [ ]:
# Get all the correlated features with the target variable
num_features = len(sorted_corr['status']) # 15 features
sorted_corr['status'].head(num_features)
Out[ ]:
status                   1.000000
google_index             0.730684
ratio_digits_url         0.356587
domain_in_title          0.339519
phish_hints              0.337287
total_of?                0.293920
url_length               0.244348
total_of/                0.240892
hostname_length          0.240681
total_of.                0.205302
total_of-               -0.102849
domain_with_copyright   -0.175469
nb_hyperlinks           -0.341295
total_of_www            -0.444561
page_rank               -0.509761
Name: status, dtype: float64
In [ ]:
# List the features from the previous step into a list
# selected_features = ['google_index', 'ratio_digits_url', 'domain_in_title', 'phish_hints', 'total_of?', 'url_length', 'total_of/','hostname_length','total_of.', 'total_of-','domain_with_copyright','nb_hyperlinks','total_of_www','page_rank']
selected_features = sorted_corr['status'].head(num_features).index.tolist()
df[selected_features] = df[selected_features].apply(pd.to_numeric, errors='coerce')

# Check the data types of the selected columns after conversion
print(df[selected_features].dtypes)

# Check if 'status' column exists and has categorical or numerical data
print(df['status'].dtype)

# Create a DataFrame with the selected columns
selected_df = df[selected_features + ['status']]
selected_df.head()
google_index               int64
ratio_digits_url         float64
domain_in_title            int64
phish_hints                int64
total_of?                  int64
url_length                 int64
total_of/                  int64
hostname_length            int64
total_of.                  int64
total_of-                  int64
domain_with_copyright      int32
nb_hyperlinks              int64
total_of_www               int64
page_rank                  int64
dtype: object
int64
Out[ ]:
google_index ratio_digits_url domain_in_title phish_hints total_of? url_length total_of/ hostname_length total_of. total_of- domain_with_copyright nb_hyperlinks total_of_www page_rank status
0 0 0.108696 1 0 1 46 3 20 3 0 1 143 1 5 1
1 1 0.054688 1 2 0 128 3 120 10 0 0 0 0 0 1
2 1 0.000000 1 0 0 52 4 25 3 0 0 3 1 0 1
3 0 0.142857 1 0 0 21 3 13 2 0 1 404 1 0 0
4 0 0.000000 0 0 0 28 3 19 2 0 0 57 1 4 0
In [ ]:
# Count the number of binary columns in the selected features

features_binary = count_binary_columns(df[selected_features])
features_binary
Out[ ]:
['status',
 'google_index',
 'ratio_digits_url',
 'domain_in_title',
 'phish_hints',
 'total_of?',
 'url_length',
 'total_of/',
 'hostname_length',
 'total_of.',
 'total_of-',
 'domain_with_copyright',
 'nb_hyperlinks',
 'total_of_www',
 'page_rank']
In [ ]:
from sklearn.preprocessing import StandardScaler
# Scale the data
selected_df = selected_df.dropna()
scaler = StandardScaler()
selected_df[selected_features] = scaler.fit_transform(selected_df[selected_features])
In [ ]:
from pandas.plotting import scatter_matrix
scatter_matrix(selected_df, alpha=1, figsize=(60, 60), diagonal='hist')
plt.show()
In [ ]:
# Create pairplot
sns.pairplot(selected_df, hue='status', palette='Set1')

# Show the plot
plt.show()
In [ ]:
target = 'status'

X = df[selected_features]
y = df[target]

🪓 Splitting into train/test¶

In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 19431 observations, of which 15544 are now in the train set, and 3887 in the test set.

🧬 Modelling¶

Support Vector Machine¶

In [ ]:
# SUPPORT VECTOR MACHINE SVM
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.8422948289169025
In [ ]:
from sklearn.metrics import classification_report
predictions = model.predict(X_test)
report = classification_report(y_test, predictions)
print(report)
              precision    recall  f1-score   support

           0       0.87      0.82      0.84      1982
           1       0.82      0.87      0.84      1905

    accuracy                           0.84      3887
   macro avg       0.84      0.84      0.84      3887
weighted avg       0.84      0.84      0.84      3887

Linear Regression¶

In [ ]:
# LINEAR REGRESSION

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("R²:", score)
R²: 0.6897128732856885
In [ ]:
import shap

# Shap explainer initialized with the model and training data
explainer = shap.Explainer(model, X_train)

# Calculate Shap values for the predictions made on the test set
shap_values = explainer.shap_values(X_test)

# Plot the Shap values using bee swarm plot
shap.summary_plot(shap_values, X_test)
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

🏘️ K-NEAREST NEIGBOURS¶

In [ ]:
# K-NEAREST NEIGHBORS

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=4)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.9114998713660921

🌲Decision Tree¶

In [ ]:
# DECISION TREE

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(min_samples_leaf=40, min_samples_split=300)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.9336249035245691
In [ ]:
target_names = ["phishing", "legitimate"]
import matplotlib.pyplot as plt
plt.figure(figsize=(40,40))
from sklearn.tree import plot_tree
plot_tree(model, fontsize=8, feature_names=selected_features, class_names=target_names)
plt.show()